Sample size requirements for training high-dimensional risk predictors.
نویسندگان
چکیده
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
منابع مشابه
Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملA Comparison of Methods for Group Prediction with High Dimensional Data
High dimensional data is the situation in which the number of variables included in an analysis approaches or exceeds the sample size. In the context of group classification, researchers are typically interested in finding a model that can be used to correctly place an individual into their appropriate group; e.g. correctly diagnose individuals with depression. However, when the size of the tra...
متن کاملEstimating Sufficient Reductions of the Predictors in Abundant High-dimensional Regressions by R. Dennis Cook1, Liliana Forzani
We study the asymptotic behavior of a class of methods for sufficient dimension reduction in high-dimension regressions, as the sample size and number of predictors grow in various alignments. It is demonstrated that these methods are consistent in a variety of settings, particularly in abundant regressions where most predictors contribute some information on the response, and oracle rates are ...
متن کاملSample size planning for developing classifiers using high-dimensional DNA microarray data.
Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed ...
متن کاملPractice Appointment Rates for High-risk Asthmatics: What could be the Predictor(s)?
Practice appointment rates could have a significant impact on national health care costs and services offered by doctors. In this respect a study was designed to determine the relationship between practice appointments and possible predictors in high-risk asthmatics. An observational retrospective analysis of the predictors for the practice appointments in asthmatic patients with at least one h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Biostatistics
دوره 14 4 شماره
صفحات -
تاریخ انتشار 2013